JEPArdy! Non-Record Submission - JEPA + Leader-Stack - val_bpb 1.1230#1243
Open
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Open
JEPArdy! Non-Record Submission - JEPA + Leader-Stack - val_bpb 1.1230#1243simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Conversation
3420019 to
59055b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
JEPArdy! Non-Record Submission - JEPA + Leader-Stack - val_bpb 1.1230
This record captures our JEPA-focused non-record submission and the slightly improbable chain of decisions that produced it.
Results (
8x H100 SXM)13371.121282541.1227134815,922,511421.121765021.1233748015,910,98320261.121570461.1230086115,809,375It beats the March 17 Naive Baseline (
val_bpb: 1.22436570) by a wide margin, and it does so with an ablation saga that still lets us say, with a reasonably straight face, that JEPA itself is doing useful work here rather than merely riding shotgun inside a stronger stack.Why We Believe JEPA Is Doing Real Work
This submission reflects the way we ended up thinking about JEPA after a week of arguing with it: not as a total replacement for the rest of the stack, but as a component that has to earn its place inside a coherent recipe.
I had not built with JEPA before and was skeptical that it would contribute anything useful to this model family. So the method was deliberately antagonistic. JEPA had to win in a cleaner setting before it was allowed into the stronger stack, and then it had to win again there under longer-horizon validation.
The first proof came in a byte-level isolation lane built against matched controls on
8xH100. There, JEPA gave a real if modest improvement:1.36692884JEPA_LOSS_WEIGHT=0.10:1.36182961After adding full-model EMA, JEPA still won:
1.36487466JEPA_LOSS_WEIGHT=0.10:1.36169919Only after that did we move into the stronger leader-family stack with
TTT_ENABLED=0, where the question changed from “does JEPA help at all?” to “how much JEPA can this recipe absorb before it becomes self-defeating?” The short screens made several settings look plausible. The longer600sruns were less easily charmed:2.22466503weight:0.05:2.22707166weight:0.15:2.29532844weight:0.10:2.19728869That back-and-forth was the point. A weaker weight looked harmless without helping much. A stronger weight looked briefly promising and then turned into a liability.
0.10was the Goldilocks setting that survived contact with the actual budget, so that is the one we promoted to the final8xH100candidate.This hybrid "symphony" of multiple instruments tuned and retuned to each other is what we should have expected to see, based on Monsieur LeCun's original JEPA paper -- though of course LeCun put it in terms of the modular structure of the mammalian brain. More recent JEPA work makes the same point from the engineering side: the 2026 LeWorldModel paper says existing JEPA methods often depend on multi-term losses, EMA, pre-trained encoders, or auxiliary supervision to stay stable; that was the inspiration for our initial decision to add EMA to the isolation lane and the leader-stack translation lane.
All this to say: modern JEPA stacks are usually hybrids of one sort or another. The interesting question is not whether JEPA does all the work by itself, but whether it changes the behavior of the whole system in a fruitful way that survives ablation and scale-up, and that's our claim here.
What We Tried And What Failed
This JEPA result did not arrive as a single clean idea. It came out of several dead ends, some misleading short screens, and a wise decision to keep explicit logs. In the hope that our failed experiments might prevent fellow travelers from tripping over the same rocks, we're including further details here, and I'd be happy to expand on any of them if asked.
Key failed or inconclusive directions:
0.15: looked best at180s, then lost badly at600sKey things that helped:
LeakyReLU(0.5)^2JEPA_LOSS_WEIGHT=0.10by longer-horizon validation, not by the cheapest screens alone.**Actually, really interestingly, we homed in on the 0.10 after a pretty simple "0.05 too little JEPA, 0.15 too much JEPA" screen, where Claude and I concluded "ah, clearly 0.10 is the sweet spot -- the ideal amount of JEPA -- right in the middle." But then I said to Claude, "do we have any strong antecedent theoretical evidence that the distribution of JEPA losses would just flatten out beyond 0.15? Why not suppose that it changes shape multiple times in the interval up to 1?" After giving me some side-eye, Claude relented and we did a sweep at every 0.05 interval, results below, indicating that the distribution is indeed not monotonic, and I was "Absolutely Right™".
Final Candidate Snapshot
Training model:
train_gpt.pyincluded in this folderJEPA_LOSS_WEIGHT=0.10fineweb10B_sp1024+fineweb_1024_bpe.modelTTT_ENABLED=0Storage-only export pass:
train_gpt.pyjepa_in.weight,jepa_out.weight) from the exported stateattn,mlp,embed, andotherfloating tensors with the int6 pathLegality and Compliance
train_gpt.pyin this folder is the self-contained submission script.16,000,000bytes.10minute budget on8xH100.10minute budget.TTT_ENABLED=0.